NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SuperBPE: Space Travel for Language Models

Liu, Alisa; Hayase, Jonathan; Hofmann, Valentin; Oh, Sewoong; Smith, Noah A; Choi, Yejin (October 2025, Conference on Language Modeling)

Full Text Available
Data mixture inference attack: BPE tokenizers reveal training data compositions

Hayase, Jonathan; Liu, Alisa; Choi, Yejin; Oh, Sewoong; Smith, Noah A (December 2024, Advances in Neural Information Processing Systems)

The pretraining data of today's strongest language models remains opaque, even when their parameters are open-sourced. In particular, little is known about the proportions of different domains, languages, or code represented in the data. While a long line of membership inference attacks aim to identify training examples on an instance level, they do not extend easily to global statistics about the corpus. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of the pretraining data. We introduce a novel attack based on a previously overlooked source of information—byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered vocabulary learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first token is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest (eg, different natural languages), we formulate a linear program that solves for the relative proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack can recover mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released alongside recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o is much more multilingual than its predecessors, training on 10x more non-English data than GPT-3.5, Llama 3 and Claude are trained on predominantly code, and many recent models are trained on 7-16% books. We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
more » « less
Full Text Available
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Hayase, Jonathan; Liu, Alisa; Choi, Yejin; Oh, Sewoong; Smith, Noah A (November 2024, https://doi.org/10.48550/arXiv.2407.16607)

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
more » « less
Full Text Available
Advancing science- and evidence-based AI policy

https://doi.org/10.1126/science.adu8449

Bommasani, Rishi; Arora, Sanjeev; Chayes, Jennifer; Choi, Yejin; Cuéllar, Mariano-Florentino; Fei-Fei, Li; Ho, Daniel E; Jurafsky, Dan; Koyejo, Sanmi; Lakkaraju, Hima; et al (July 2025, Science)

Policy must be informed by, but also facilitate the generation of, scientific evidence
more » « less
Full Text Available
Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration

Feng, Shangbin; Sorensen, Taylor; Liu, Yuhan; Fisher, Jillian; Park, Chan_Young; Choi, Yejin; Tsvetkov, Yulia (December 2024, EMNLP)

Full Text Available
Tuning Language Models by Proxy

Liu, Alisa; Han, Xiaochuang; Wang, Yizhong; Tsvetkov, Yulia; Choi, Yejin; Smith, Noah_A (October 2024, COLM)

Full Text Available
Tuning Language Models by Proxy

Liu, Alisa; Han, Xiaochuang; Wang, Yizhong; Tsvetkov, Yulia; Choi, Yejin; Smith, Noah A (October 2024, Conference on Language Modeling)

Full Text Available
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Sclar, Melanie; Choi, Yejin; Tsvetkov, Yulia; Suhr, Alane (May 2024, International Conference on Learning Representations)

As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.
more » « less
Full Text Available
Do Membership Inference Attacks Work on Large Language Models?

Duan, Michael; Suri, Anshuman; Mireshghallah, Niloofar; Min, Sewon; Shi, Weijia; Zettlemoyer, Luke; Tsvetkov, Yulia; Choi, Yejin; Evans, David; Hajishirzi, Hannaneh (October 2024, COLM)

Full Text Available
Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Mireshghallah, Niloofar; Kim, Hyunwoo; Zhou, Xuhui; Tsvetkov, Yulia; Sap, Maarten; Shokri, Reza; Choi, Yejin (May 2024, International Conference on Learning Representations)

Existing efforts on quantifying privacy implications for large language models (LLMs) solely focus on measuring leakage of training data. In this work, we shed light on the often-overlooked interactive settings where an LLM receives information from multiple sources and generates an output to be shared with other entities, creating the potential of exposing sensitive input data in inappropriate contexts. In these scenarios, humans nat- urally uphold privacy by choosing whether or not to disclose information depending on the context. We ask the question “Can LLMs demonstrate an equivalent discernment and reasoning capability when considering privacy in context?” We propose CONFAIDE, a benchmark grounded in the theory of contextual integrity and designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. CONFAIDE consists of four tiers, gradually increasing in complexity, with the final tier evaluating contextual privacy reasoning and theory of mind capabilities. Our experiments show that even commercial models such as GPT-4 and ChatGPT reveal private information in contexts that humans would not, 39% and 57% of the time, respectively, highlighting the urgent need for a new direction of privacy-preserving approaches as we demonstrate a larger underlying problem stemmed in the models’ lack of reasoning capabilities.
more » « less
Full Text Available

« Prev Next »

Search for: All records